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Introduction or setting the stage 


A major challenge in the analysis of many biological data matrices 
is due to their sizes: relatively smali number of records (samples), 
often of the order of tens, versus thousands of attributes or 
features for each record. 

An obvious example, albeit rather classical today, are microarray 
gene expression experiments (here, the features are genes or, morę 
precisely, their expression levels). Another, and a very specific one, 
is that of analyzing molecular interaction networks underlying 
HIV-1 resistance to reverse transcriptase inhibitors (here, the 
features are some physicochemical properties of amino acids). In 
Genome-Wide Association Studies, while we have thousands 
observations, each consists of hundreds of thousands of features. 
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Introduction or setting the stage 


By far, it is not only in Life Sciences, where problems of this type 
appear and have to be dealt with. 

Indeed, in our own work, we met fascinating problems of 
commercial origin, including transactional data from a major 
multinational FMCG (fast-moving consumer goods) company and 
geological data from oil wells operated by a major American oil 
company. 
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Introduction or setting the stage 


Such tasks, regardless of whether the data are to explain a 
quantitative (as in regression) or categorical (as in classification) 
trait, are quite different from typical data mining problems, in 
which the number of features is much smaller than the number of 
samples. 

Indeed, in a sense, these are ill-posed problems. It is immediately 
elear in the case of linear regression fitted by least-squares. 
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Introduction or setting the stage 


For two-class classification, at least from the geometrical point of 
view, the task is trivial, sińce in a c/-dimensional space, as many as 
d + 1 points can be divided into two arbitrary and disjoint subsets 
by some hyperplane, provided that these points do not lie in a 
proper subspace of the cf-dimensional space. 
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Introduction or setting the stage 


It is another matter that the hyperplane (or any other classification 
rule) found should have the generalization ability. 


In any case, whether in classification or in regression, sińce it is 
rather a rule than an exception that most features in the data are 
not informative, it is of utmost importance to select the few ones 
that are informative and that may form the basis for class 
prediction or building a proper regression model. 

That is, before building a classifier or a regression model, or while 
building any of them, we would like to find out which features are 
specifically linked to the problem at hand and should be included 
in the solution. 
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Introduction or setting the stage 


Mathematically, properly formulated sparsity constraints should be 
included when seeking a solution. As we shall see, this requirement 
can be fulfilled by randomization or regularization. 


Regarding classification one morę important issue should be 
emphasized: 

Morę often than not, rather than obtaining the best possible 
classifier, the Life Scientist needs to know which features 
contribute best to classifying observations (samples) into distinct 
classes and what are the interdependencies between the features 
which describe the observation. 
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Introduction or setting the stage 


When dealing with multiple explanatory variables (features), one 
needs to address the problem of hypothesis testing, We therefore 
begin our exposition with a brief discussion of multiple hypothesis 
testing. 

We then turn, and confine ourselves, to the area of supervised 
learning. Within the context of very high dimensional problems, in 
particular the smali n large p problems, it is reasonable to divide 
the whole into three (morę or less) separate families of approaches 
to such learning: 

9 Monte Carlo methods 

9 Regularization approaches (with a penalty for model 
complexity) 

9 Bayesian approaches. 
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Introduction or setting the stage 


Important remark: It should be emphasized that these three 
families of approaches are not disjunctive but are partly 
overlapping. In particular, penalty for model complexity can be 
Bayesian (like Bayesian Information Criterion, BIC), what pertains 
to Bayesian regularization. Moreover, it is of utmost interest, and 
adds to their inherent beauty, that methods from different families 
share, or have similar, mathematical foundations. 
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Multiple hypothesis testing 


Univariate approach based on multiple hypothesis testing: while 
disregarding interactions between features, it is statisticaIly sound 
and all to well illustrates the intricacy of the problem: 

Assume a two-class classification case. For each /c-th feature we 
are interested in testing the nuli hypothesis Ho/c of no relationship 
between the decision attribute (class) and the feature against the 
alternative that such a relationship does exist. 

For each /c-th feature, k = 1,..., d, a natural test statistic is a 
t-statistic 


xik ~ *2k 

s lk + S2k 

although examined without assuming normal distribution of the 
feature. 

A real catch is that we have to perform not one bu| d such testsk 
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Multiple hypothesis testing 


The battery of tests should have a fixed level of the probability of 
type one error, e.g., 

FWER = family-wise error ratę = P(FP > 1) < a 

where FP stands for the number of false positives (i.e., type I 
errors) 

or 


FDR = false discovery ratę = F(FP/(FP + 7~P)) < a 

as well as a reasonable power of the whole procedurę, e.g., 

P(TP> 1) 

where TP stands for the number of true positives. 
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Multiple hypothesis testing 


Bonferroni’s (1936) classical procedurę, under which any nuli 
hypothesis is rejected at level a/d , Controls the FWER, 

FWER = family-wise error ratę = P(FP > 1) < a, 
for arbitrary test statistics joint nuli distributions; that is, 

P(FP > 1) < J2 P Ho; (i-th test rejects) < —a < a, 

ien 0 d 

where Ho runs over the indices corresponding to true nuli 
hypotheses and h = \H\. 

(Under independence of test statistics and complete nuli 
hypothesis, 

FWER = 1 - (1 -a/d) d \ 

the FWER is smaller, if they are positively dependent.) 
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Multiple hypothesis testing 


Notę that under the Bonferroni procedurę any nuli hypothesis is 
rejected regardless of the values of test statistics for other 
hypotheses. 

A morę sophisticated procedurę of Benjamini and Hochberg (1995; 
see the next slide) Controls the FDR, 


FDR = false discovery ratę = F(FP/(FP + TP)) < a, 


for independent test statistics (or, morę generally, for positively 
regression dependent test statistics). 
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Multiple hypothesis testing 


The Banjamini and Hochberg procedurę: 

1. Let 

P( 1) < P(2) < ' ' ' < P(d) 
denote the observed ordered p-values 


2 . 


L = max{j : p (i) < 0 - 3 } 


3. Reject all hypotheses /-/py, such that < p(/_). 

Thus, the p-values must be obtained, but this can be done by a 
simple resampling procedurę. 

For this section see Dudoit and van der Laan (2008). 
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MC approaches: Model selection for linear regression - 
Random Subspace Method (RSM) 


Mielniczuk and Teisseyre (2011) and (2013): Let Tj^ m be a 
t-statistic for /-th predictor in a linear regressionn model m with 
\m\ predictors. We have: 

T?, m = RSS m _ {/} - RSS m 
n — \m\ RSS m 

It follows that the value of T? m can serve as a measure of, 
simulatneously, the importance of the /-th predictor in model m 
and the quality of this very model. 
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MC approaches: Model selection for linear regression 
Random Subspace Method (RSM) 


In the RSM, a random subset m of features (predictors), of size 
| m\ smaller than the number of all features d and a number of 
observations n, is chosen. The model is fitted in the reduced 
feature space by OLS. Each of the selected features is assigned a 
weight describing its relevance in the considered submodel. 

Random selection of features is repeated many times, 
corresponding submodels are fitted and the finał weights (scores) 
of all d features are computed on the basis of all submodels. 

The finał model can then be constructed based on predetermined 
number of the most significant predictors or using a selection 
method applied to the nested list of models given by the ordering 
of predictors. 
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MC approaches: MCFS-ID Algorithm of Draminski et al.: 
the Monte Carlo Feature Selection (or MCFS) part 


In what follows we begin with a brief description of an effective 
method for ranking features according to their importance for 
classification regardless of a classifier to be later used. Our 
procedurę is conceptually very simple, albeit computer-intensive. 

We consider a particular feature to be important, or informative, if 
it is likely to take part in the process of classifying samples into 
classes "morę often than not”. 

This "readiness” of a feature to take part in the classification 
process, termed relative importance of a feature, is measured via 
intensive use of classification trees. When assessing relative 
importance of a feature, the aforementioned "readiness" of the 
feature to appear in a given tree is suitably moderated by the 
(weighted) accuracy this tree. 
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MC approaches: MCFS-ID Algorithm: the MCFS part 


In the main step of the procedurę, we estimate relative importance 
of features by constructing thousands of trees for randomly 
selected subsets of features. 

Morę precisely, out of all d features, s subsets of m features are 
selected, m being fixed and m « d, and for each subset of 
features, t trees are constructed and their performance is assessed. 
Each of the t trees in the inner loop is trained and evaluated on a 
different, randomly selected training and test sets which come from 
a split of the fuli set of training data into two subsets: each time, 
out of all n samples, about 66% of samples are drawn at random 
for training (in such a way as to preserve proportions of classes 
from the fuli set of training data) and the remaining samples are 
used for testing. 
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MC approaches: MCFS-ID Algorithm: the MCFS part 
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MC approaches: Interdependency Discovery, i.e., the ID 
part of the MCFS-ID Algorithm 


In the MCFS part of the algorithm, a cutoff between informative 
and non-informative features is provided. From now on, our 
interest is confined to the set of informative features. 

This approach to interdependency discovery is significantly 
different from known approaches which consist in finding 
correlations between features or finding groups of features that 
behave similarly in some sense across samples (e.g., as in finding 
co-regulated features). 

The focus is on identifying features that "cooperate” in 
determining that a sample belongs to a particular class. A directed 
graph of such "cooperating” features is constructed. 
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MC approaches: Interdependency Discovery, i.e., the ID 
part of the MCFS-ID Algorithm 


For an exposition of the MCFS-ID algorithm in its full-flegded 
versions, see Draminski et al. (2008), (2010), (2016a) and (2016b) 

Regarding the ID part of the algorithm, see also the last section of 
this presentation. 
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Regularization approaches: Model selection for linear 
regression - Ł\ regularization 


The Lasso (Least Absolute Selection Operator) for linear models: 

As usual, we are given n observations, each with d explanatory 
variables (predictors), (x;i,x/2, • • • , x /c /), and one response variable, 

y/. 

Yi — A) + fll x il + @2 x i2 + • • • + fld x i,d + £/> 1 = 1> 2, . . . , A7, 

where E\ are i.i.d. random errors with mean 0 and unknown 
variance cr 2 , and /?o,... ,A/ are unknown parameters. 

Minimize 




/=1 7=1 


subject to 



Regularization approaches: Ł\ regularization 


The Lasso, in contrast to ridge regression (i.e., £2 regularization), 
eliminates for smali t some variables from the model. It can thus 
be used as a feature selection method, although one should be 
aware that the method is likely to include too many variables. 


For exhaustive account of the Lasso and related approaches see 
Biihlmann and van de Geer (2011) and Hastie, Tibshirani and 
Wainwright (2015). For an important extension of the idea see 
Pokarowski and Mielniczuk (2015), where a three-stage algorithm 
for selecting a regression model is proposed, with LASSO used in 
the lst stage for screening of predictors (features). See also 
Bogdan et al. (2015), where the regularizer is a sorted £\ norm. 
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Regularization approaches: Support Vector Machines - ii 
regularization. And morę 


We skip an exposition of SVMs. Regarding their use for Big Data 
Analytics, we refer to Tan et al. (2014) and to Priyadarshini and 
Agarwal (2015). 

There are morę statistical approaches to dealing with 
high-dimensional data than those already hinted to and the 
Bayesian ones. See Biihlmann and van de Geer (2011) for an 
approach which stems from undirected graphical modeling and is 
based on inferring zero partial correlations for variable selection 
(the so-called PC-simple algorithm). 

A still another and promising approach, which builds on ranking 
the marginal correlations and is referred to as surę independence 
screening, has been introduced by Fan and Lv (2008); see also Fan 
and Song (2010). 

< □ ► g >> < = ► ◄ 1 ► 1 -O 0,0 


Jacek Koronacki Analiza danych o wielkim wymiarze 



Model selection for linear regression - Bayesian approaches 


Broman and Speed (2002): Let 

d 

yi = n + Yl Pj x 'J + 

7=1 

where x,y = 1 or Xjj = 0 and the £; are i.i.d. and normally 
distributed, A/(0,cr 2 ) (in fact, x,y represents genotype at marker j 
for individual /). The task is to select a model for which Schwarz’s 
Bayesian Information Criterion (BIC) assumes the minimal value; 

BIC = a? • log RSS((3) + ^/clog/7, 

where k is the number of parameters / 3j in the model. It was 
observed by Broman i Speed that the BIC tends to overestimate 
the number of parameters in the model. Accordingly, they 
proposed the lst modification of the BIC. 
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Model selection for linear regression - Bayesian approaches 


The Bayesian model selection advocates choosing the model M 
that maximizes posterior probability of the model given the data, 
this probability being proportional to 

L(y\M)n(M), 

where tt(M) is a prior probability for model M (Schwartz assumed 
noninformative uniform prior 7 r), and 

L(y\M) = I L(y\M,[3)f(f3\M)df3, 

f((3\M) being some prior distribution on the vector of model 
parameters; for a wide class of these distributions one gets 

logL(y\M) = \ogL(y\(3) - ^(k + 2)logn. 

For the family of normal linear regression models, maximization of 
this last expression is equivalent to minimization of Jhe BIC. 
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Model selection for linear regression - Bayesian approaches 


Bogdan et al. (2004) introduced another modification of BIC 
(mBIC), assuming binomial prior distribution, Bin(c/, c/c/), with 
some fixed c, for the model size. See Bogdan et al. (2011) for later 
developments and Frommlet et al. (2012) for application of their 
approach to Genome-Wide Association Studies. 

It is easy to extend the outlined approach to include regression 
models with interactions. It is also possible to extend it to include 
generalized linear models (possibly with constraints on the modefs 
parameters). 

The outlined approach is by far not the only one possible among 
this strand of Bayesian approaches; e.g., a similar approach is that 
based on the extended BIC, and a completely different approach, 
which bears some relationship with support vector machines, is 
that of relevance vector machines. (See, e.g., Chen and Chen 
(2008) and (2011), and Tipping (2001), Fletcher (2010) and 
Saarela et al. (2010).) 
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Nonparametric Bayesian approaches 


Let Y be a response and X = (X ^\... ,X^ p ) E R p be explanatory 
variables. Assume 


Y = f(X) + e, 

with e normally distributed, A/(0, a 2 ). 

Usually, a Gaussian Process (GP) prior for f is assumed to have 
zero mean and square exponential covariance function (kernel 
function) exp(—||x — x / || 2 /c). Such processes are smooth in a 
well-known sense. Other kernels can be used, and another 
smoothness conditions on f can be imposed. 

It should be emphasized that the above mentioned use of a kernel 
function casts the whole approach into the area of ML with kernels 
(kernel machines). Indeed, some far reaching similarities (and 
differences) with ridge regression, SVMs, as well as with spline 
models are obvious and deserve separate analysis. 
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Nonparametric Bayesian approaches, contd. 


An excellent exposition of Gaussian processes for ML is given in 
Rasmussen and Williams (2006); another excellent, albeit short, 
introduction to GPs in ML can be found in Bishop (2006). In 
neither of these expositions problems pertaining to dealing with 
Big Data are addressed, although Rasmussen and Williams (2006) 
has a chapter on Approximation Methods for Large Datasets. 
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Nonparametric Bayesian approaches, contd. 


However interesting GPs for ML are, within the context of Big 
Data Analytics, special emphasis has to be placed on variable 
selection and/or variable projections. Loosely speaking, such 
mechanisms can be included into the nonparametric Bayesian 
approach by adding morę randomness into the process, i.e., 
introducing suitable hyperparameters. See Tokdar (2011) for 
variable selection and linear projection proposals which have been 
shown to give consistent (in probability, and at a known ratę) 
estimators of an unknown f \ e.g., for f depending on d < p 
variables, the ratę of convergence is 

n~ 2 a+d (log n) k 

for any k > p + 1. 

Yang (2014) has noticed that Tokdar’s proposal can be considered 
effective only if d « p. 
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Nonparametric Bayesian approaches, contd. 


Yang (2014) has provided a generał framework to assess the 
minimax risks for regression problems under i 2 loss (see there for 
an excellent account of earlier, sometimes pioneering, results in the 
area). He has introduced a generał class of Bayesian sieve 
estimators which, under certain (morę or less restrictive) 
conditions, achieve the optimal minimax risk when f depends on 
d « min{n, p} variables or is a sum of finitely many, k, 
functions, each of which depends on d s « min{A7, p} variables. 

He has shown also that a GP regression approach can lead to the 
minimax optimal adaptive ratę in estimating f under some 
conditions when the function’s domain lies on a Riemannian 
manifold. 

See also Yang and Dunson (2014) and Yang and Tokdar (2015). 
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A word on Big Data Analytics from a generał perspective 


It seems now widely accepted that the term Big Data refers one to 
situations when data are characterized by at least three or four 
”Vs” (cf., e.g., chapter 1 in Japkowicz and Stefanowski (2016)): 

o Volume - huge and, usually continuously increasing, size of 
the collected and analyzed data 

* Velocity - high speed at which the data is generated and input 
into an analyzing system 

® Variety - heterogenous and complex representations of the 
analyzed data 

® Variability - changes in the structure of the data, as well as 
changes in how users want to interpret that data. 

Clearly then, strictly speaking, Massive Data should not be 
confused with Big Data. 
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A word on Big Data Analytics from a statistical perspective 


a Statistical approaches form an indispensable and crucial part 
of Machinę Learning 

a As of now, while statistical meta-analyses as well as, e.g., 
probabilistic methods of linking data from different sources are 
studied and developed, statistical approaches are best suited 
to deal only with Massive Data from a homogenous source 

a As such, statistical approaches form an indispensable and 
crucial part of Big Data Analytics, however if used within 
homogenous settings 

a The importance of statistical approaches follows from their 
explanatory power and methodological rigorousness 
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A word on Big Data Analytics from a statistical 
perspective, contd. 


a A statistician is well aware that he/she can apply statistical 
techniques only when the data come from repetitions of some 
events. He/she is also well aware that the data at hand, when 
properly analyzed, can help answer only some specific 
questions, by far not any questions of interest. He/she is well 
equipped to examine data for possible biases or other faults. 

a Methods of statisical learnig provide causal models when 
possible (feasible), and predictive algorithms (behavioral 
models) when deeper cognizance of the phenomenon under 
scrutiny is unavailable. 
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A word on Big Data Analytics from a statistical 
perspective, contd. 


9 Paradoxically, it was an extraordinary development of 
Computer technologies what freed statisticians dealing with 
massive data from John Tukey’s prison of Exploratory Data 
Analysis with its slogan "Let the data speak for themselves” 

9 In 1979, William Eddy, a not so famous as John Tukey but a 
morę radical statistician proclaimed: 

”The data analytic method denies the existence of 'truth', the 
only knowledge is empirical. 

[...] If we can make without models, I think we should.” 

9 Today, a nonmilitant statistician prefers to say: 

If we cannot make with models, we should make without 
them. 
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A word on Big Data Analytics from a statistical 
perspective, contd. 


o Flooded by Big Data, some researchers claim essentially the 
same what radical proponents of EDA claimed decades ago. 
They say that, e.g., given Big Data, we can abandon causal 
explanations, sińce it suffices to know correlations which 
enable one to predict; cf. discussions of this issue in chapters 
1 and 2 in Japkowicz and Stefanowski (2016). 

® Even if any pretext can serve the purpose of regressing to 
foolishness, it is better to stay wise and try to understand, not 
only to predict. 

o Happily, the earlier discussed methods of statistical learning 
are used and developed to advantage, and widely, within the 
Big Data settings. 
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In lieu of a conclusion - morę on the ID part of the 
MCFS-ID Algorithm 


For a given training set of samples, an ensemble of decision trees 
has been constructed within the MCFS part of the algorithm. Each 
decision rule provided by each tree has the form of an "ordered 
conjunction” of conditions imposed on particular separate features. 

(Notę that trees are ”flexible” classifiers, where flexibility amounts 
to classifier’s ability to produce rules as complex as is needed.) 

Clearly then, each decision rule points to some interdependencies 
between the features appearing in the conditions. Indeed, the 
information included in such decision rules, when properly 
aggregated, reveals interdependencies (however complex they may 
prove) between features which are best "correlated” with or, as has 
been said, "cooperate” in determining, the samples’ classes. 
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The ID part of the MCFS-ID Algorithm, contd. 


To see how an ID-Graph is built, let us recall again that each node 
in each of the multitude of classification trees represents a feature 
on which a split is madę. Now, for each node in each classification 
tree its all antecedent nodes can be taken into account along the 
path to which the node belongs. 

For each pair [antecedent node —> given node] we add one directed 
edge to our ID-Graph from antecedent node to given node. 

The edges are found along the paths in all the s • t MCFS trees. 
Clearly, the same edge can appear morę than once even in a single 
tree. 
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The ID part of the MCFS-ID Algorithm, contd. 


The strength of the interdependence between two nodes, actually 
two features, connected by a directed edge, termed ID weight of a 
given edge (ID weight for short), is defined in the following way: 


For node a7/c(t) in the r-th tree, r = 1,..., s • t, and its antecedent 
node A?/(r), ID weight of the directed edge from n/(r) to a?/ c (t), 
denoted w[a?/(t) —► ^(r)], is equal to 


w[ni(T ) 


n k{ T )] = GR (n k (r)) ( 


no. in A?/ C (r)\ 
no. in n/(r) J ’ 


(i) 


where GR(ą(r)) stands for gain ratio for node n/^r), 

(no. in a?/ c (t)) denotes the number of samples in node fi/^r) and 
(no. in a7/(t) denotes the number of samples in node 
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The ID part of the MCFS-ID Algorithm, contd. 


The finał ID-Graph is based on the sums of all ID weights for each 
pair [antecedent node —► given node ]. 

That is, for each directed edge found, its ID weights are summed 
over all occurrences of this edge in all paths of all MCFS 
classification trees. 

For a given edge, it is this sum of ID weights which becomes the 
ID weight of this edge in the finał ID-Graph. 
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The ID part of the MCFS-ID Algorithm, contd. 


In sum, an ID-Graph provides a generał roadmap that not only 
shows all the most variable attributes that allow for efficient 
classification of the objects but, moreover, it points to possible 
interdependencies between the attributes and, in particular, to a 
hierarchy between pairs of attributes. High differentiation of the 
values of ID weights in the ID-Graph gives strong evidence that 
some interdependencies between some features are much stronger 
than others and that they create some patterns/paths calling for 
interpretation based on background knowledge. 


< □ <1 g > < = ► ◄ = ► JF -O Q^ O 


Jacek Koronacki 


Analiza danych o wielkim wymiarze 




The ID part of the MCFS-ID Algorithm - a toy example 


Consider objects from 3 classes, A, B and C, that contain 40, 20 
and 10 objects, respectively (70 objects altogether). For each 
object, create 6 binary features (Al, A2, BI, B2, Cl and C2 ) that 
are 'ideally’ or ’almost ideaIly' correlated with class feature. If an 
object’s 'class' equals 'A', then its features Al and A2 are set to 
class value 'A’\ otherwise Al = A2 = 0. If an object’s 'class' is 'B' 
or 'C', we proceed analogously, but we introduce some random 
corruption to 2 observations from class 'B' and to 4 observations 
from class ’C'\ in the former case, for each of the two observations 
and both attributes B1/B2, we randomly replace their value 'B' by 
’0’ and in the latter case, again for each of the four observations 
and both attributes C1/C2, we randomly replace their value 'C' by 
’0’. The data also contains additional 500 random numerical 
features with uniformly [0,1] distributed values. Thus we end up 
with 6 nominał important features (3 pairs with different levels of 
importance for classification) and 500 randomly distributed. 
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The ID part of the MCFS-ID Algorithm - a toy example 


Top Features (RI_norm) 



Rysunek: Top features selected by MCFS-ID. 
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The ID part of the MCFS-ID Algorithm - a toy example 



Rysunek: ID-Graph for artificial data. 
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The ID part of the MCFS-ID Algorithm - a toy example 


In the ID-Graphs, as seen in the Figurę, some additional 
information is conveyed with the help of suitable graphical means. 
The color intensity of a node is proportional to the corresponding 
feature’s RI. The size of a node is proportional to the number of 
edges related to this node. The width and level of darkness of an 
edge is proportional to the ID weight of this edge. Since we would 
like to review only the strongest ID weights let us plot ID-Graph 
with only 12 top edges. 
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The ID part of the MCFS-ID Algorithm - a toy example 



Rysunek: ID-Graph for artificial data, limited to top 6 features and top 12 
ID weights. 
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The ID part of the MCFS-ID Algorithm, and morę - 
Discovering interactions on a finer level 


The ID-Graph does not tell the differences between the classes, 
i.e., it does tell what interdependencies make the samples belong 
to different classes but does not give rules which determine any 
given class. Accordingly and separately, a way to construct rule 
networks is also provided, where the networks are constructed from 
IF-THEN rules with one network per each decision class. 

Please see Bornelov, Mariilet and Komorowski (2014) and 
Draminski et al. (2016a) for our proposal. 

Concluding, let us add that while the current version of the 
MCFS-ID is a new one, it is already included in CRAN (The 
Comprehensive R Archive Network). Moreover, along with a 
module to discover rule networks their explanatory power has been 
verified on a number of molecular and medical examples. 
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